Exploratory Analysis of Seattle Collisions by Leah Erb

This report explores collision reports from Seattle, Washington, USA.

The Dataset

The Seattle Collisions dataset is a compilation of over 200,000 collision reports created by Seattle Police Department (SPD) that were then recorded by Seattle Department of Transportation (SDOT), between the years 2004 and 2018.

#
# Load data
#

# manually toggle datafile as local or live source
datafile = 'Collisions.csv'
#datafile = 'http://data-seattlecitygis.opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv'

# import empty strings as NULL
df_raw = read.csv(datafile, na.strings=c(""," "), header=TRUE)
# df = working dataset (to be cleaned); df_raw = data as loaded
df <- data.frame(df_raw)

Assess raw data

glimpse(df_raw)
Observations: 205,877
Variables: 38
$ OBJECTID        <int> 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, …
$ INCKEY          <int> 1358, 1181, 60000, 37500, 1350, 50000, 25300, 40…
$ COLDETKEY       <int> 1358, 1181, 60000, 37500, 1350, 50000, 25300, 40…
$ REPORTNO        <fct> 3568600, 3501659, 2607865, 1797759, 3380693, 218…
$ STATUS          <fct> Matched, Matched, Matched, Matched, Matched, Mat…
$ ADDRTYPE        <fct> Block, Block, Block, Intersection, Intersection,…
$ INTKEY          <int> NA, NA, NA, 34562, 35095, NA, NA, NA, 36052, 270…
$ LOCATION        <fct> GREENWOOD AVE N BETWEEN N 134TH ST AND N 136TH S…
$ EXCEPTRSNCODE   <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ EXCEPTRSNDESC   <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ SEVERITYCODE    <fct> 1, 1, 1, 2, 2, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1, 1, …
$ SEVERITYDESC    <fct> Property Damage Only Collision, Property Damage …
$ COLLISIONTYPE   <fct> Left Turn, Sideswipe, Other, Angles, Cycles, Par…
$ PERSONCOUNT     <int> 2, 3, 2, 4, 5, 3, 2, 2, 2, 2, 2, 2, 1, 3, 2, 2, …
$ PEDCOUNT        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ PEDCYLCOUNT     <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ VEHCOUNT        <int> 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 3, 2, 2, …
$ INJURIES        <int> 0, 0, 0, 1, 1, 0, 2, 0, 2, 1, 0, 0, 0, 1, 0, 0, …
$ SERIOUSINJURIES <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ FATALITIES      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ INCDATE         <fct> 2013-04-02T00:00:00.000Z, 2013-03-30T00:00:00.00…
$ INCDTTM         <fct> 4/2/13 15:10, 3/30/13 14:00, 10/24/06 10:01, 7/2…
$ JUNCTIONTYPE    <fct> Mid-Block (not related to intersection), Mid-Blo…
$ SDOT_COLCODE    <int> 11, 11, 11, 11, 51, 11, 14, 11, 11, 14, 11, 11, …
$ SDOT_COLDESC    <fct> "MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END A…
$ INATTENTIONIND  <fct> NA, NA, NA, NA, Y, NA, Y, Y, NA, Y, NA, NA, Y, N…
$ UNDERINFL       <fct> N, N, 0, 0, N, 0, 0, 0, N, 0, N, 0, 0, 0, N, 0, …
$ WEATHER         <fct> Overcast, Clear or Partly Cloudy, Raining, Clear…
$ ROADCOND        <fct> Dry, Dry, Wet, Dry, Dry, Dry, Dry, Wet, Dry, Dry…
$ LIGHTCOND       <fct> Daylight, Daylight, Daylight, Daylight, Dusk, Da…
$ PEDROWNOTGRNT   <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ SDOTCOLNUM      <int> NA, NA, 6297026, 5209012, NA, 6052020, 4274006, …
$ SPEEDING        <fct> NA, NA, Y, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ ST_COLCODE      <int> 28, 11, 23, 10, 45, 19, 14, 71, 28, 14, 20, 71, …
$ ST_COLDESC      <fct> From opposite direction - one left turn - one st…
$ SEGLANEKEY      <int> 0, 0, 0, 0, 29054, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ CROSSWALKKEY    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ HITPARKEDCAR    <fct> N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, …

What are the min and max INCDATEs?

# function to print min/max INCDATE of given dataframe
print_date_range <- function(this_df){
  this_df %>%
    summarize(min_INCDATE = min(as.Date(this_df$INCDATE)),
              max_INCDATE = max(as.Date(this_df$INCDATE)))
}
print_date_range(df)
  min_INCDATE max_INCDATE
1  2003-10-06  2019-01-31

Which columns have NULL values, and how many?

# helper function: print columns with NULLs and their counts
count_nulls <- function(this_df) {
  df_nulls <- sapply(df, function(x) sum(is.na(x)))
  as.data.frame(df_nulls[df_nulls>0]) %>% rename_at(1, ~'null_count')
}

count_nulls(df_raw)
               null_count
ADDRTYPE             3608
INTKEY             139484
LOCATION             4420
EXCEPTRSNCODE      195531
EXCEPTRSNDESC      195531
SEVERITYCODE            1
COLLISIONTYPE       24275
JUNCTIONTYPE        11002
SDOT_COLCODE            1
SDOT_COLDESC            1
INATTENTIONIND     177916
UNDERINFL           24254
WEATHER             24423
ROADCOND            24371
LIGHTCOND           24517
PEDROWNOTGRNT      201022
SDOTCOLNUM          78672
SPEEDING           196602
ST_COLCODE          24275
ST_COLDESC          24275

Observation: date and datetime columns have no NULL values.

Several columns have only 1 NULL, are they all from the same record?

cols <-  c('SEVERITYCODE','SDOT_COLCODE','SDOT_COLDESC')
subset(df, is.na(SEVERITYCODE) | is.na(SDOT_COLCODE) | is.na(SDOT_COLDESC), cols)
       SEVERITYCODE SDOT_COLCODE
117               0           NA
197563         <NA>           18
                                                SDOT_COLDESC
117                                                     <NA>
197563 MOTOR VEHICLE STRUCK PEDALCYCLIST, FRONT END AT ANGLE

Observation: No.

Can we deduce a value for the NULL SDOT_COLCODE, given other values in the record?

subset(df, is.na(SDOT_COLCODE))
    OBJECTID INCKEY COLDETKEY REPORTNO  STATUS     ADDRTYPE INTKEY
117     2117   3663      3663  1787335 Matched Intersection  29797
                  LOCATION EXCEPTRSNCODE EXCEPTRSNDESC SEVERITYCODE
117 BROADWAY AND CHERRY ST          <NA>          <NA>            0
    SEVERITYDESC COLLISIONTYPE PERSONCOUNT PEDCOUNT PEDCYLCOUNT VEHCOUNT
117      Unknown          <NA>           0        0           0        0
    INJURIES SERIOUSINJURIES FATALITIES                  INCDATE INCDTTM
117        0               0          0 2003-10-06T00:00:00.000Z 10/6/03
    JUNCTIONTYPE SDOT_COLCODE SDOT_COLDESC INATTENTIONIND UNDERINFL
117         <NA>           NA         <NA>           <NA>      <NA>
    WEATHER ROADCOND LIGHTCOND PEDROWNOTGRNT SDOTCOLNUM SPEEDING
117    <NA>     <NA>      <NA>             Y    3279003     <NA>
    ST_COLCODE ST_COLDESC SEGLANEKEY CROSSWALKKEY HITPARKEDCAR
117         NA       <NA>          0            0            N

Observation: No. There is very little information recorded. Interestingly, this record was the first recorded. It may have been a test insert?

Can we deduce a value for the NULL SEVERITYCODE, given other values in the record?

subset(df, is.na(SEVERITYCODE))
       OBJECTID INCKEY COLDETKEY REPORTNO  STATUS     ADDRTYPE INTKEY
197563   197563 270579    271399  3753394 Matched Intersection  29084
                             LOCATION EXCEPTRSNCODE EXCEPTRSNDESC
197563 EASTLAKE AVE E AND E NELSON PL          <NA>          <NA>
       SEVERITYCODE SEVERITYDESC COLLISIONTYPE PERSONCOUNT PEDCOUNT
197563         <NA>      Unknown          <NA>           2        0
       PEDCYLCOUNT VEHCOUNT INJURIES SERIOUSINJURIES FATALITIES
197563           1        1        0               0          0
                        INCDATE      INCDTTM
197563 2018-02-14T00:00:00.000Z 2/14/18 7:50
                                 JUNCTIONTYPE SDOT_COLCODE
197563 At Intersection (intersection related)           18
                                                SDOT_COLDESC
197563 MOTOR VEHICLE STRUCK PEDALCYCLIST, FRONT END AT ANGLE
       INATTENTIONIND UNDERINFL  WEATHER ROADCOND LIGHTCOND PEDROWNOTGRNT
197563           <NA>         N Overcast      Wet  Daylight          <NA>
       SDOTCOLNUM SPEEDING ST_COLCODE ST_COLDESC SEGLANEKEY CROSSWALKKEY
197563         NA     <NA>         NA       <NA>          0            0
       HITPARKEDCAR
197563            N

Observation: The NULL SEVERITYCODE could be changed to ‘0’ because INJURY, SERIOUSINJURIES and FATALITIES = 0

What do Date/Time columns look like?

INCDATE ...
 [1] 2013-04-02T00:00:00.000Z 2013-03-30T00:00:00.000Z
 [3] 2006-10-24T00:00:00.000Z 2005-07-28T00:00:00.000Z
 [5] 2013-04-02T00:00:00.000Z 2006-02-21T00:00:00.000Z
 [7] 2004-09-30T00:00:00.000Z 2005-06-07T00:00:00.000Z
 [9] 2013-04-02T00:00:00.000Z 2005-08-26T00:00:00.000Z
[11] 2013-04-02T00:00:00.000Z 2005-08-07T00:00:00.000Z
[13] 2005-11-17T00:00:00.000Z 2004-07-11T00:00:00.000Z
[15] 2013-04-05T00:00:00.000Z 2005-03-04T00:00:00.000Z
[17] 2013-04-02T00:00:00.000Z 2010-05-12T00:00:00.000Z
[19] 2013-04-02T00:00:00.000Z 2013-04-02T00:00:00.000Z
5507 Levels: 2003-10-06T00:00:00.000Z ... 2019-01-31T00:00:00.000Z
INCDTTM ...
 [1] 4/2/13 15:10   3/30/13 14:00  10/24/06 10:01 7/28/05 11:45 
 [5] 4/2/13 19:15   2/21/06 0:13   9/30/04 12:40  6/7/05        
 [9] 4/2/13 19:29   8/26/05 7:23   4/2/13 8:05    8/7/05        
[13] 11/17/05 8:56  7/11/04 17:25  4/5/13 7:31    3/4/05        
[17] 4/2/13 13:25   5/12/10 21:00  4/2/13         4/2/13 19:07  
155661 Levels: 1/1/04 1/1/05 1/1/06 1/1/07 1/1/08 1/1/09 ... 9/9/18 9:22

What do binary-esque columns look like?

# Helper function to return value frequency of given column.
count_pct <- function(df) {
  return(
    df %>% 
      tally() %>% 
      mutate(freq = (n_pct = n / sum(n)))
  )
}
# Show count and % of column values
cols <- c('UNDERINFL', 'INATTENTIONIND', 'SPEEDING', 'HITPARKEDCAR')
lapply(df[cols], function(x) {
  x <- enquo(x)
  df %>% group_by(fct_explicit_na(!! x)) %>% count_pct
  })
$UNDERINFL
# A tibble: 5 x 3
  `fct_explicit_na(...)`     n   freq
  <fct>                  <int>  <dbl>
1 0                      81692 0.397 
2 1                       4231 0.0206
3 N                      90957 0.442 
4 Y                       4743 0.0230
5 (Missing)              24254 0.118 

$INATTENTIONIND
# A tibble: 2 x 3
  `fct_explicit_na(...)`      n  freq
  <fct>                   <int> <dbl>
1 Y                       27961 0.136
2 (Missing)              177916 0.864

$SPEEDING
# A tibble: 2 x 3
  `fct_explicit_na(...)`      n   freq
  <fct>                   <int>  <dbl>
1 Y                        9275 0.0451
2 (Missing)              196602 0.955 

$HITPARKEDCAR
# A tibble: 2 x 3
  `fct_explicit_na(...)`      n   freq
  <fct>                   <int>  <dbl>
1 N                      197334 0.959 
2 Y                        8543 0.0415

For counts-of-things, what % are > 0?

cols <- c('PERSONCOUNT','PEDCOUNT','PEDCYLCOUNT','VEHCOUNT',
          'INJURIES','SERIOUSINJURIES','FATALITIES')
as.data.frame(lapply(df[,cols], 
                     function(x){
                       length(which(x!=0))/length(x) *100
                       } 
                     ))
  PERSONCOUNT PEDCOUNT PEDCYLCOUNT VEHCOUNT INJURIES SERIOUSINJURIES
1    88.57473 3.538521    2.661298 88.10649 27.90841        1.390636
  FATALITIES
1  0.1544612

 Assessment Observations

May require cleaning:

  • INCDATE is a factor with format ‘YYYY-MM-DDT00:00:00.000Z’
  • INCDTTM is a factor with varying formats of ‘M/DD/YY HH:MM’ and ‘M/DD/YY’
  • Minimum INCDATE = ‘2003-10-06T00:00:00.000Z’
  • SEVERITYCODE, SDOT_COLCODE and SDOT_COLDESC: each have only 1 NULL value.
  • The SEVERITYCODE=NA could be marked ‘0’ instead because INJURY, SERIOUSINJURIES and FATALITIES = 0
  • The only incident with no SDOT_COLCODE is also the first record (by timestamp). It may be a test insert.
  • SDOT_COLCODE contains categorical data but is datatype Integer.
  • Binary fields:
  • INATTENTIONIND appears to default to NULL unless specifically set to ‘Y’. ( >86% NULL)
  • SPEEDING appears to default to NULL unless specifically set to ‘Y’. (>95% NULL)
  • HITPARKEDCAR has no NULLs, all values are ‘N’ or ‘Y’. (>95% ‘N’)
  • UNDERINFL is all over the map, with 0, 1, N, Y and NULL values. (> 95% 0, N or NULL)
  • Some key columns may not be of use in our analysis since we not, at this time, joining with other tables: -OBJECTID, -REPORTNO, -STATUS, -EXCEPTRSNCODE, -EXCEPTRSNDESC, -COLDETKEY, -INCKEY, -INTKEY, -SDOTCOLNUM, -SEGLANEKEY

Of interest: * more than 88% collisions have non-zero values for PERSONCOUNT and VEHCOUNT * fewer than 28% collisions have non-zero values INJURIES * fewer than 5% of PEDCOUNT, PEDCYLCOUNT, SERIOUSINJURIES and FATALITIES are non-zero values


Clean

Date/Time columns

  • Change INCDATE datatype to Date

  • Create date/time columns:
    • Year (YYYY)
    • Month (1-12)
    • Time (HH:MM)
    • Hour (of day)
    • DayPart (Morning, Afternoon, Evening, Night)
    • Season (Spring, Summer, Fall, Winter)
# convert INCDATE to date datatype
df$INCDATE <- as.Date(df$INCDATE)

# create year and month columns
df$Year <- as.factor(lubridate::year(as.Date(df$INCDATE)))
df$Month <- as.factor(lubridate::month(as.Date(df$INCDATE)))


# create Time and Hour columns
df$Time <- as.POSIXct(word(df$INCDTTM, 2), format='%H:%M')
df$Hour <- as.integer(substr(df$Time, 12, 13))

# categorize time of day 
df$DayPart <- factor(with(df,
     ifelse(Hour >= 6 & Hour < 12, 'Morning',
     ifelse(Hour >= 12 & Hour < 17, 'Afternoon',
     ifelse(Hour >= 17 & Hour < 20, 'Evening',
     ifelse(Hour >= 20 | Hour < 6, 'Night',
            'Unknown'))))))

# categorize Season
df$Season <- factor(with(df,
     ifelse(Month %in% c('3','4','5'), 'Spring', 
     ifelse(Month %in% c('6','7','8'), 'Summer', 
     ifelse(Month %in% c('9','10','11'), 'Fall', 
     ifelse(Month %in% c('12','1','2'), 'Winter', 
            'Unknown'))))))

Simplify and standardize

  • Rename columns
  • UNDERINFL -> DUI
  • INATTENTIONID -> DISTRACTED
  • Group common SDOTtypes (e.g., combine multiple ‘%Sideswipe%’ types into one named ‘Sideswipe’)
  • Shorten SEVERITYDESC values by removing ‘Collision’ from strings
  • Clean binary variables so all values are either Y or N. Change NULL to N by default, meaning that the attribute was NOT (=‘N’) marked as a feature of the collision
  • Change SEVERITYCODE to 0 where NULL
  • Change factors to ‘Unknown’ where NULL (WEATHER, ROADCOND, LIGHTCOND)

Fix and trim

  • Delete the first record by INCDATE (it appears to have been a test insert)
  • Change NULL SEVERITYCODE to 0
  • Remove columns not of interest at this time
  • Create df_full_yrs to exclude the latest year if it does not yet contain a full-year’s worth of data
# delete first (test) record from 2003
df_backup <- data.frame(df)

df <- subset(df, Year != '2003')

# change NULL SEVERITYCODE to 0
df$SEVERITYCODE[is.na(df$SEVERITYCODE)] <- 0

# remove columns not of interest at this time
df <- df %>%
  select(-OBJECTID, -REPORTNO, -STATUS, -EXCEPTRSNCODE, -EXCEPTRSNDESC,
         -COLDETKEY, -INCKEY, -INTKEY, -SDOTCOLNUM, -SEGLANEKEY, -CROSSWALKKEY, 
         -PEDROWNOTGRNT, -LOCATION, -HITPARKEDCAR, -ST_COLCODE, -ST_COLDESC,
         -SDOT_COLDESC)

# Create df_trim_yrs to contain only complete year's worth of data
# TODO: include logic, remove hardcode
df_full_yrs <- subset(df, Year != '2019')

# Verify
print_date_range(df)
  min_INCDATE max_INCDATE
1  2004-01-01  2019-01-31
print_date_range(df_full_yrs)
  min_INCDATE max_INCDATE
1  2004-01-01  2018-12-31

Data structure after cleaning

glimpse(df)
Observations: 205,876
Variables: 28
$ ADDRTYPE        <fct> Block, Block, Block, Intersection, Intersection,…
$ SEVERITYCODE    <fct> 1, 1, 1, 2, 2, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1, 1, …
$ SEVERITYDESC    <fct> Property Damage Only, Property Damage Only, Prop…
$ COLLISIONTYPE   <fct> Left Turn, Sideswipe, Other, Angles, Cycles, Par…
$ PERSONCOUNT     <int> 2, 3, 2, 4, 5, 3, 2, 2, 2, 2, 2, 2, 1, 3, 2, 2, …
$ PEDCOUNT        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ PEDCYLCOUNT     <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ VEHCOUNT        <int> 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 3, 2, 2, …
$ INJURIES        <int> 0, 0, 0, 1, 1, 0, 2, 0, 2, 1, 0, 0, 0, 1, 0, 0, …
$ SERIOUSINJURIES <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ FATALITIES      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ INCDATE         <date> 2013-04-02, 2013-03-30, 2006-10-24, 2005-07-28,…
$ INCDTTM         <fct> 4/2/13 15:10, 3/30/13 14:00, 10/24/06 10:01, 7/2…
$ JUNCTIONTYPE    <fct> Mid-Block (not related to intersection), Mid-Blo…
$ SDOT_COLCODE    <int> 11, 11, 11, 11, 51, 11, 14, 11, 11, 14, 11, 11, …
$ DISTRACTED      <fct> N, N, N, N, Y, N, Y, Y, N, Y, N, N, Y, N, N, N, …
$ DUI             <fct> N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, …
$ WEATHER         <fct> Overcast, Clear or Partly Cloudy, Raining, Clear…
$ ROADCOND        <fct> Dry, Dry, Wet, Dry, Dry, Dry, Dry, Wet, Dry, Dry…
$ LIGHTCOND       <fct> Daylight, Daylight, Daylight, Daylight, Dusk, Da…
$ SPEEDING        <fct> N, N, Y, N, N, N, N, N, N, N, N, N, N, N, N, N, …
$ Year            <fct> 2013, 2013, 2006, 2005, 2013, 2006, 2004, 2005, …
$ Month           <fct> 4, 3, 10, 7, 4, 2, 9, 6, 4, 8, 4, 8, 11, 7, 4, 3…
$ Time            <dttm> 2019-05-14 15:10:00, 2019-05-14 14:00:00, 2019-…
$ Hour            <int> 15, 14, 10, 11, 19, 0, 12, NA, 19, 7, 8, NA, 8, …
$ DayPart         <fct> Afternoon, Afternoon, Morning, Morning, Evening,…
$ Season          <fct> Spring, Spring, Fall, Summer, Spring, Winter, Fa…
$ SDOTtype        <fct> Sideswipe, Sideswipe, Sideswipe, Sideswipe, Stru…
count_nulls(df)
        null_count
Time         50292
Hour         50292
DayPart      50292


Univariate Plots

How have the total number of collisions changed over the years?

ggplot(data = df_full_yrs, aes(x = Year)) +
  geom_bar()  +
  scale_y_continuous(breaks = seq(0, 18000, 2000))

The distribution of collision frequency by year appears bimodal, peaking at 2005 and, to a lesser extent, 2015, with a low at 2010 and again in 2018. This plot hints at a possible 5-year pattern.

I had expected to see a steady increase in Seattle collisions, mirroring population growth. Perhaps I’m wrong about the steady increase in population, or that the number of reported collisions have a strong correlation with the population in the first place.


Do collision frequencies change throughout the day?

ggplot(data = df %>% count(Hour), aes(x = Hour, y = n )) + 
  geom_line() + 
  ylim(0,15000)

It appears that collision counts tend to rise during commute hours, starting at the lowest point around 4am.


What are the COLLISION types, according to the SPD?

Our dataset includes 2 Collision type (category) variables:

  1. COLLISIONTYPE reported by SPD (Seattle Police Department)

  2. SDOTtype recorded by SDOT (Seattle Department of Transportation)

# TODO: print % instead of totals, x-axis on top

ggplot(data = df,
       aes(x = reorder(COLLISIONTYPE, COLLISIONTYPE, function(x) length(x) * 1)  )) +
  geom_bar(width = .95)   +
  mylayer_count_above +
  mylayer_long_xlabel +
  labs(x = 'SPD COLLISIONTYPE', y = 'Count') +
  coord_flip()


What are the collision types as recorded by SDOT?

# TODO: print % instead of totals, x-axis on top

ggplot(data = df,
       aes(x = reorder(SDOTtype, SDOTtype, function(x) length(x) * 1))) +
  geom_bar(width = .95)  +
  mylayer_count_above +
  mylayer_long_xlabel +
  labs(x = 'SDOTtype', y = 'Count')+
  coord_flip()


SPD has 11 collision categories while SDOT‘s has 7. Which type I choose to include in this study could have an impact on our analysis. For example, if we are analyzing collisions involving Pedestrians, SPD reports only 1,770 ’Pedestrian’, while SDOT reports ‘Hits Pedestrian’ 17,846 times. That’s a nearly 10x difference. Also, SDOT’s ‘Rear End’ count is almost twice as many as SPD’s.

It is possible that SPD has reason to categorize collisions differently than SDOT. The reasons are outside the scope of this project. Nonetheless, it would be interesting to look into how Collision types are recorded, and if it would make a differences in our analysis.

In this report, except where otherwise noted, will be using SDOT’s Collision categories.


What are the SEVERITY categories?

# TODO: remove decimal in %
severity_level_order <- c('Unknown','Property Damage Only', 'Injury', 'Serious Injury', 'Fatality' )

ggplot(df, 
       aes(x = factor(SEVERITYDESC, level = severity_level_order))) +
  geom_bar(aes(y = (..count..)/sum(..count..))) +
  scale_y_continuous(labels = percent_format()) +
  labs(x = 'SEVERITYDESC', y = 'Percent')+
  coord_flip()

Observations:

  • This plot orders the Severity bins in order of worsening severity.

  • The frequency of collisions by Severity is right skewed (towards higher severity).

  • Over 60% of collisions result in Property Damage Only (no injuries or deaths).

The Severity factor alone does no quantify the damage done by collisions. We could make a rough attempt to quantify damage by using columns VEHCOUNT, INJURIES, SERIOUSINJURIES and FATALATIES in a calculation, but it would be a crude quantification.


Geographical, environmental and time factors

Let’s get a rough idea of how counts are distributed amongst these types of categorical factors. Don’t worry about not being able to read the x-axis labels yet :-> .

Observations:

  • Location (AddrType and JunctionType) each have a two predominant types. We might dig into the these two variables later, to see if JunctionType is a subset of AddrType.

  • Environmental conditions (Weather, LightCond, RoadCond) each have a dominant categories. Earlier we saw that most collisions happen in the middle of the day, so I suspect that is represented here (Clear, Daylight, Dry).

  • Time factor (Season, DayPart) frequencies a more evenly spread, with the highest frequency in Summer Afternoons.


The Counts of Things

These are the quantitative variables, which are simply counts of things. Let’s see what they are and how they’re spread out.


Since many of the Counts of Things have 0 value, plot them again but this time with log10 so non-zero values are more visible.

    VEHCOUNT       PERSONCOUNT        PEDCOUNT        PEDCYLCOUNT     
 Min.   : 0.000   Min.   : 0.000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.:0.00000   1st Qu.:0.00000  
 Median : 2.000   Median : 2.000   Median :0.00000   Median :0.00000  
 Mean   : 1.734   Mean   : 2.229   Mean   :0.03698   Mean   :0.02685  
 3rd Qu.: 2.000   3rd Qu.: 3.000   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :15.000   Max.   :93.000   Max.   :6.00000   Max.   :2.00000  
    INJURIES      SERIOUSINJURIES      FATALITIES      
 Min.   : 0.000   Min.   : 0.00000   Min.   :0.000000  
 1st Qu.: 0.000   1st Qu.: 0.00000   1st Qu.:0.000000  
 Median : 0.000   Median : 0.00000   Median :0.000000  
 Mean   : 0.374   Mean   : 0.01507   Mean   :0.001656  
 3rd Qu.: 1.000   3rd Qu.: 0.00000   3rd Qu.:0.000000  
 Max.   :78.000   Max.   :41.00000   Max.   :5.000000  


The Counts of Things frequencies are all right-skewed. Only the VEHCOUNT and PERSONCOUNT peak at 2 instead of 0.


Let’s look closer at Vehicle and Person counts …

VEHCOUNT

“The number of vehicles involved in the collision. This is entered by the state.”

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   2.000   2.000   1.734   2.000  15.000 

The vast majority of collisions involve 2 vehicles. The mean, median, Q2 and Q3 are all roughly the same (2), as verified with the flat box plot.


PERSONCOUNT

“The total number of people involved in the collision”

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   2.000   2.000   2.229   3.000  93.000 


Observations: * Most collisions involve 2 people, the next highest-count being 3 persons. * Some reportedly have zero PERSONCOUNT. * Median and Q1 are the same (2), with Q3 being just one away, then a very long tail to a maximum of 93. * There is no data available to explain the circumstances of the larger PERSONCOUNTs.


How about driver diminished capacity and error (DUI, DISTRACTED, Speeding)?


Observations:

  • DISTRACTED is reported more often then DUI.

  • Surprisingly, SPEEDING is infrequently noted in collision reports.


Univariate Analysis

What is the structure of your dataset?

There were 205,381 collisions reported by SPD between the years 2004 and 2018.

Most of the variables are categorical, describing:

  • environmental factors
  • location factors
  • points in time

Binary variables (Yes or No) include:

  • DISTRACTED
  • DUI
  • Speeding

Quantitative variables include counts of:

  • bicycles
  • vehicles
  • people
  • injuries
  • fatalities

Other observations:

Most collisions:

  • happen in a ‘block’ location (as opposed to intersection or alley)
  • involve 2 vehicles
  • occur on a clear or partly cloudy day, during daylight on a dry road
  • are categorized as a ‘Rear End’ or ‘Sideswipe’
  • usually involve property damage only, as opposed to injury or fatality

What is/are the main feature(s) of interest in your dataset?

The main features of interes include:

  • Time: are there trends across the years, or during an average day?
  • DUI: are there patterns to DUI collisions?
  • Collision Type: Possible disparities between SPD’s and SDOT’s classification of collision types.
  • Severity: What kind of collisions have higher Severity.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Counts of Things may help explain possible SPD and SDOT collision classification disparities.

Counts of Things may also help explain Severity classifications.

Also, geographical and environmental variables may help support the investigation of main features.

Did you create any new variables from existing variables in the dataset?

Yes. I created point-in-time variables to help make reporting and graphing easier:

  • Year
  • Season
  • Hour
  • DayPart (morning, afternoon, etc)

I created a new collision categorical variable (SDOTtype) to group common SDOT’s collision types. For example, I combined 6 types of ‘Sideswipe’ collision types into one category called ‘Sideswipe’.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I log-transformed the strongly right-skewed Counts of Things distributions (vehicles, people, injuries, etc). Both the vehicle and person counts peaked at 2 while all other counts peaked at 0. Person count, injuries and serious injuries have a long right tail. The log10 transformation made it easier to see counts other than the left-side peaks.

Other changes:

  • Original values if DUI included: 0, 1, Y, N and null. In order to standardize the values, I changed 0 to N and 1 to Y.

  • I deleted rows with Incident Dates before 2004 or after 2018. Original records outside that range contained only partial-year records.

  • There was a single record with a null value for SEVERITYCODE. I first verified there were no counts of injuries or fatalities then set the value to 0.

  • Original WEATHER values included both ‘Unknown’ and NULL. I standardized by combining both as ‘Unknown’.

  • Original SEVERITYDESC string values were postfixed with ‘Collision’, which made labels unnecessarily long, so I removed those postfixes.

  • I deleted columns from the dataset that were not of interest for this study (such as report numbers and administrative codes.)

  • I renamed two columns for easier programming and plot reading:

  • DUI <- UNDERINFL (“Whether or not a driver involved was under the influence of drugs or alcohol.”)
  • DISTRACTED <- INATTENTIONID (“Whether or not collision was due to inattention. (Y/N)”)


Bivariate Plots Section

From this correlation chart, we see moderately strong relationships between:

**DUI* and …

Vehicle Count and …

Let’s view these stronger correlations in a couple different ways …


Correlations table

                DUI VEHCOUNT SEVERITYCODE COLLISIONTYPE WEATHER
DUI            1.00     0.05         0.10         -0.04   -0.09
VEHCOUNT       0.05     1.00         0.38         -0.38   -0.50
SEVERITYCODE   0.10     0.38         1.00         -0.40   -0.43
COLLISIONTYPE -0.04    -0.38        -0.40          1.00    0.38
WEATHER       -0.09    -0.50        -0.43          0.38    1.00
PERSONCOUNT    0.05     0.55         0.37         -0.23   -0.32
ROADCOND      -0.04    -0.30        -0.25          0.22    0.76
LIGHTCOND     -0.20    -0.48        -0.42          0.39    0.59
              PERSONCOUNT ROADCOND LIGHTCOND
DUI                  0.05    -0.04     -0.20
VEHCOUNT             0.55    -0.30     -0.48
SEVERITYCODE         0.37    -0.25     -0.42
COLLISIONTYPE       -0.23     0.22      0.39
WEATHER             -0.32     0.76      0.59
PERSONCOUNT          1.00    -0.19     -0.31
ROADCOND            -0.19     1.00      0.30
LIGHTCOND           -0.31     0.30      1.00


Easy-reference plot


It is clear that DUI and VEHCOUNT have the strongest relationships.

create_summary <- function(data, col1, col2) { col1 <- enquo(col1) col2 <- enquo(col2) result_summary <- data %>% group_by(!! col1, !! col2) %>% summarise(count = n()) %>% mutate(perc = count/sum(count)) mutate(label = percent(perc %>% round(5))

result_summary }


COLLISIONTYPE and VEHCOUNT

COLLISIONTYPE and VEHCOUNT have a moderately-strong relationship (.5). Let’s see what they look like.

It appears that ‘Rear End’ collisions involve noticably more vehicles in single incidents. This chart shows percentages, not counts, so we do not know from this plot just how many collisions involve these multiple vehicles.


Has the rate of DUI collisions changed over the years?

Interestingly, the lowest recorded DUI rate was the same year as our second-highest total collision year (2015). What happened in 2015 to cause such a drop?

However, when we plot the rate of DUI in perspective with all collisions, the DUI rate changes lose their punch and appear relatively steady.


Let’s look at the same plot for DISTRACTED.

Although DISTRACTED was not strongly correlated with other factors, we did see in the Univariate section that DISTRACTED has a similar rate as DUI, so let’s see how the DISTRACTED rate changed over the years.

There was a significant increase in DISTRACTED being a factor in collisions, starting in the year 2013. Perhaps this is a result of the increase in the use of mobile devices, or directives for the SPD to record ‘DISTRACTED’ as a factor in collisions. More data is required to do any more analysis, outside the scope of this study.


SPD Reports vs SDOT Records

In the Univariate section, we saw a possible disparity between SPD’s reporting of collision factors and SDOT’s follow-up recording of these factors. Let’s look into this a little more.


View By Count

It’s interesting that SPD reports categorizing a collision as ‘Sideswipe’, eventually gets categorized by the SDOT as ‘Other Collision’ or ‘Rear End’. Also, it appears that SPD ‘Pedestrian’ labels frequently show up in SDOT records as ‘Head On’.


View By Percent

View the same information as the plot above, but show the percentages of COLLISIONTYPE’s within each SDOTtype, instead of count:

Plotting by percentage makes it easier to see the spread of COLLISIONTYPEs. For example, we can now easily see that SPD ‘Sideswipe’ (yellow) are also found in SDOT ‘Hits Pedestrian’ records.

Also, we can now see that SPD reported collisions involving Cycles are spread-out over several SDOT categories.

It is possible that SDOT reclassifies Collision Types after all the dust settles and more facts are available. It would be interesting to dig a little deeper into reclassifications.


Let’s do a similar study of JUNCTIONTYPE vs ADDRTYPE.

It appears that the the reportings of JUNCTIONTYPE and ADDRTYPE classifications are fairly in sync, that JUNCTIONTYPE is used as a sub-category to ADDRTYPE.


Has the distribution of any Collision types rates changed over the years?

Surprisingly, it appears the % of collisions with ‘Hits Pedestrian’ rose significantly around the years where there were fewer total collisions (2010-2011). Are there more pedestrians on the road, and fewer drivers? Maybe more inattentive pedestrians getting hit? That would be an interesting further study, but outside the scope of this project since DISTRACTED does not specify who was inattentive.

It will be interesting to look more into the Hits Pedistrian increase.


Adjusted Hits Pedestrian

There are several variables that indicate a Pedestrian was involved in a collision (COLLISIONTYPE, SDOTtype, PEDCOUNT). Are all pedestrian-related collisions being marked by SDOT as ‘Hits Pedestrian’?

Let’s Bundle all records with any Pedestrian indicator into ‘Hits Pedestrian’. Will the increased trend of ‘Hits Pedestrian’ records stay the same, or even out across the years?

Here we see approximately 25% of the records with some indication of Pedestrian involvement were originally classified as ‘Head On’. This suggests that most pedestrians are hit with the front of a car, as opposed to sideswiped or rear-ended.

If we re-classify all the above as ‘Hits Pedestrian’, will the trend change?

Comparing the original and the Adjusted plots, it appears that the ratios are pretty much stay the same, except some of the ‘Head On’ records got moved to ‘Hits Pedestrian’.

Theory: Whatever causes the SPD and SDOT discrepancies in ‘Pedestrian’ reporting appears to be consistent, and the increased rate in ‘Hits Pedestrian’ is true (we cannot reject the null hypothesis that pedestrian’s getting hit rate did not change.)


DUI relationships


SEVERITY

The proportions appear similar between non-DUI and DUI, in that the more serious the Severity, the fewer the collisions count, whether DUI is true or not.

What if we did look at the proportions instead of counts, will the proportions still appear the same, as suggested as in the plot above? While we’re at it, let’s compare proportions for the DISTRACTED variable.

The proportions of Severity for DUI are similar but not the same. Here we see there are more serious severities in DUI collisions.

Interestingly, the rates of Severity between DUI and DISTRACTED indicators are very similar, although Severity tends to be a worse in DUI collisions.

Later, let’s see if other factors come into play with the seriousness of DUI. For example, perhaps more DUI collisions occur at night, when people tend to inbibe.

 

Out of curiosity, how often are DUI and DISTRACTED both marked as factors on the same Collision report?

Most reports with any DUI value (‘Y’ or ‘N’), DISTRACTED is not marked. All reports with no DUI value, DISTRACTED also has no value.

This plot was just for curiosity, I will no longer pursue DISTRACTED analysis.


Other DUI relationships

Light Conditions has the most significant differences between non-DUI and DUI. There are significantly more collisions in the ‘Dark - Street Lights On’ when DUI. Whether or not this is simply a factor of time (night time being when the streetlights are on and, perhaps, when more people are DUI), and less to do with the fact that the street lights are on, is a question requiring further study.

Although Weather Condition had a strong correlation (.6) with DUI, there does not appear to be a significantly difference of Weather in the non-DUI collisions. Although, it is worth noting that very few collision reports neglected to report the WEATHER (‘Unknown’) for DUI collisions. This may be another indication that SPD reports are more completely filled-out when DUI is suspected.

Collision Type of ‘Other’ is more than double when DUI is True, but Sideswipes are significantly less, as are ‘Hits Pedestrian’.


Let’s look a little closer at the SDOTtype and LIGHTCOND plots in larger scale.

DUI and SDOTtype

It would be interesting to break-down ‘Other’ Collision Types for more details.

DUI and LIGHTCOND

Is it possible that glare from Street Lights could be a factor in DUI collisions?
(digging into this more is way outside the scope of this project)


How are counts by LIGHTCOND spread out across the day?

Higher counts by LIGHTCOND at certain parts of the day hold no surprises: collisions in afternoons will naturally tend to be during Daylight.

It is interesting that most collisions occur during the afternoon. Is road volume a contributing factor?


LIGHTCOND and TIME line graph.

Let’s look at the same data, in a line graph format. It may be easier to see what’s going on. Let’s also look at percentages instead of counts.

Again, no surprises here … LIGHTCOND and Hour are paired.


Is there a change in hourly-collision patterns between the 2-highest and 1-lowest years?

2015, our second-hightest ranking year, has a more distinctive peak during the lunch hours, as well as overall increase during the day. Interestingly, this same hear saw a smaller midnight collision count than the other two years.

2005, our lowest year, had a small drop in the morning commute hours that the other two years did not. Plus, there is a plateau of collision counts around the lunch hour that the other years did not have, and the spike in the evening commute is less pronounced.


Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Collision Types and Vehicle Counts

  • There are a few patterns to Collision types and the number of vehicles that tend to be involved in each type. Apparently, ‘Sideswipe’ incidents involve anywhere bewteen 2 and 9 vehicles. Most ‘Pedestrian’ incidents involve just one car, but there is at least one ‘Pedestrian’ that involved 7 cars.

DUI and other factors

  • The total number of DUI reports have changed over the years, but the rate of change is not dramatic.

  • In the span of the average day, there is a definite trend in that DUI’s generally occur in the evening and night hours.

  • DUI’s are involved in fewer ‘Sideswipe’ and ‘Pedestrian’ collision types, and more ‘Other’, suggesting more digging into ‘Other’ is needed.

  • DUI incidents tend to have higher severities.

SPD vs SDOT Classifications

  • There is a difference between SPD’s and SDOT’s categorization of collisions, which is interesting but appear to be consistent. It was SPD’s COLLISIONTYPE that proved to have stronger correlations with other factors, but for the sake of simplicity in my analysis, I chose to continue using SDOT’s categories in my plots.

  • There appeared to be anomalies when it came to SPD’s reporting of Pedestrian related collisions, and SDOT’s recordings of them. It turns out that whatever is happening between reporting and recording is consistent. It appears that SDOT categorizes some Pedestrian hits as ‘Head On’ as opposed to ‘Hits Pedestrian’.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

  • DISTRACTED increased significantly starting around 2013.

  • Most collisions happen in LIGHTCOND = ‘Daylight’, and during the afternoons.

  • I’m surprised that LIGHTCOND did not have a higher score with Hour in the correlation tables, and had a weak relationship with DayPart.

  • It appears that JUNCTIONTYPE is a subset of ADDRTYPE.

  • Weather (surprisingly, the cor value is high, but there is little difference between DUI and non-DUI weather, other than fewer DUI records are marked with WEATHER=‘Unknown’)

  • Light Condition (not surprisingly, DUI collisions are mostly in Street Light conditions)

What was the strongest relationship you found?

In general, DUI and VEHCOUNT each have the strongest relationships with other features.

The strongest relationship was between DUI and LIGHTCOND. The LIGHTCOND of ‘Street Lights On’ more than doubles when DUI is true. However, this may very well be a ‘correlation is not causation’ example, in that DUI people may more frequently be driving at night when street lights just happen to be on.


Multivariate Plots Section

In this section, I dig deeper into exploring relationships between:


How does the rate of DUI collisions changes compare to total collisions over the years?

Although DUI rates change very little over the years, they do seem to drop in rate when the total number of collisions rise.

/

When DUI’s happen, compared to when Injuries happen.

This plot suggests that, although the total number of collisions between midnight and 4am are at their lowest, most of these collisions can be attributed to DUIs.

Also, the number of serious injuries increases during commute hours (4-6pm), and the number of fatalities his highest around 6pm, and around 9pm.


What kind of collisions involve fatalities, and on average how many annually?

In collisions that involved fatalities, on average most were ‘Hits Pedestrian’ and ‘Other Collision’. There were a few outliers in each of ‘Sideswipe’ and ‘Other Collision’, with more than 2x the annual average fatalities in certain years. Not surprisingly, there were the fewest fatalities in ‘Struck Object’.

More analysis would be required to figure out what ‘Other Collision’ involves. Looking at SPD’s COLLISIONTYPE helps, but they even have a significant ‘Other’ category with a significant number of ‘Other’ fatalities. But we can see here that some annual fatalities involve ‘Cycles’ and ‘Head On’:


SPD vs SDOT with VEHCOUNTS

In the Bivariate section, we explored the SPD’s Collision relationship with VEHCOUNT. Now let’s also compare VEHCOUNT with SDOT’s Collision categories, to see what the differences are, if any.

Where as the SPD tends to not categorize all the VEHCOUNT=0 reports (COLLISIONTYPE=‘’), SDOT follows these incidents up by assigning their own categorization to the incident. Also, in the VEHCOUNT=12 column (there may have been only 1 VEHCOUNT=12 record), the SPD marked it as ’Parked Car’ while SDOT followed-up and categorized the incident(s) as ‘Hits Pedestrian’. There are a number of other inconsistencies that would require further analysis outside the scope of this project.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The DUI / Light Condition / Severity relationship is strong: DUI collisions tend to happen when ‘Street Lights - On’, and they also tend to have higher severities.

Also, DUI / Time of Day / Severity relationships are strong, in that LIGHTCOND and Time are strongly paired.

It appears that if there is an incident between approximately 12am and 4am, it will most likely involve a DUI.

Looking averages across the years, reports suggest that most fatalities occur with ‘Hits Pedestrian’ and ‘Other’.

Were there any interesting or surprising interactions between features?

The disparity between SPD and SDOT collision categorization is made clearer when looking at Fatality averages. It is unclear why SDOT categorizes some ‘Other’ collisions when SPD categorizes them as ‘Head On’, and some of those may involve pedestrians.

I am surprised to see that (generally) when annual DUI rates are up, total collision incident counts are down.


Final Plots and Summary

Description One

I wanted to introduce the Seattle Collisions dataset with a simple visual illustrating collision incident trends over time (across the Years, and through the course of an average day.)

The By Year plot shows the presence of a repeated rise-and-fall trend, highlighting 2 peak years (2005 and 2015) and 1 valley year (2010).

The By Time of Day presents a simple trend of collision frequencies throughout any given day, with a trend of increasing incidents from morning until afternoon commute hours.

Design justifications

I chose to use a 2-column grid, as opposed to a 1-column/2-row grid, so the y-axis would be taller. I did this because the range of incident counts (y-axis) is fairly large (up to 15,000), and wanted the plot to represent the large change in counts from year to year. If I used a 1-column grid, the difference in years would presented as subtle. If I had used the more subtle chart, I felt the need to print a rate changes on each bin to show the differences, which proved to be too distracting.

I debated whether to include a point-line in the By Time of Day plot, to show where points actually sat in relation to the smoother slope. I decided to keep the plot simple, as it’s purpose was to simply show the general trend of collision incidents throughout the day.

I chose a grey color scheme to keep the charts simple. I did not want colors to distract from the main point (the trends). However, I did vary a scale of grey on the 2-highest and 1-lowest incident count By Years to draw the audience’s attention to those extremes.


Plot Two - Categorical Differences

Description Two

I chose this plot because it shows the disparity between how the SPD (Seattle Police Department) and SDOT (Seattle Department of Transportation) choose to categorize each incident. For example, the left two bins show that when SDOT categorizes an incident as ‘Heads On’, the SPD had already categorized it as ‘Pedestrian’. The SPD and SDOT had the same classification for very few of the ‘Heads On’ and ‘Pedestrian’ labels. Less than 50% of the time they agreed on ‘Rear Ended’ classifications.

It is possible that both systems of categorization are coordinated intentionally, that the absense of matching values is not of disagreement but that by using both we get better picture of each incident. It may also suggest the dataset could be enhanced if each category was a boolean factor, as opposed to two single-value factors.

Design justifications

I chose bright colors (from a colorblind-friendly color scheme) to emphasize the SPD vs SDOT disparities. I hope the audience’s first impression to see the presense of a complication, and interesting enough to spend a few moments comparing the SPD vs SDOT incident classifications.


Plot Three - DUI Severities

Description Three

I chose this plot to illustrate the increased rate of injury and death in DUI incidents versus non-DUI incidents. DUI incidents have a 10 times greater chance of resulting in Fatalities, and more than 3 times greater chance of Serious Injuries.

Design justifications

I debated whether or not to use points in this plot, since the number of DUI incidents are far fewer than non-DUI, therefore resulting in fewer points on the plot (even though the top three rates are higher). I decided to include the points because I wanted to demonstrate that, even though there are fewer DUI incidents, their rate of fatalities and injuries are much higher.

I also removed the color legend because it would have been redundant with the y-axis labels.


Reflection

The dataset

I wanted to challenge myself with this project by finding my own dataset to use.

I initially thought the Seattle Collisions was clean. While exploring the data, I progressively realized that was not quite true. I discovered the hard way that cleaning does take significant time (as warned in the project instructions!) Not just the time for writing code, but the time it takes to dig deeper into the data in order to make ethical decisions on how best to clean while avoiding the unintentional misrepresention of data.

Also, this dataset has very few quantitative factors, making it challenging to create a variety of graph types. That being said, I learned more about plotting in R than I may not have otherwise. I learned quite a bit by experimenting with color, scale, labels, and which plot types to use with which data type.

I enjoyed the ‘running commentary, stream of thought’ nature of this data exploration project. By not strategizing too much ahead of time exactly what I wanted to conclude at the end, I made a few surprising discoveries.

The data

I was surprised that annual collision counts did not steadily rise over the years, but has a distinct rise/fall pattern. Additional analysis in future work should introduce Seattle Population estimates, to see how it correlates with collision data, and patterns of incident factors over time.

I was not surprised there is an increase in the rate if ‘DISTRACTED’ collisions, given the rise in the use of mobile devices. This may also be an interesting study for future work. What would ‘Innattention’ patterns reveal with more study of how (and how often) mobile devices are used by car occupants as well as pedestrians and cyclists.

Looking into the SPD vs SDOT categorizations made it more real to me how important it is to know how data is collected and stored before it even gets to the analyst, otherwise it’s difficult to make any ascertions about the data and what it represents. For example, are the SPD and SDOT categorizations of each incident a coordinated effort to describe an incident, or is data collection inconsistent or incomplete?